Applied Computational Intelligence - HW1

Authors

The goal is to get a good insight into a dataset by mean of summary statistics and visualisations. For this exercise we choose a Kaggle Breast Cancer Datase, available here.

0. Packages and General Options

1. Describe your data

  1. Their features in terms of number of observations $N$
  2. Number of predictor variables $D$
  3. Number of classes $L$
  4. Class-distribution (that is, the number of observations for each of the classes)

1.1 - 1.4

2. Unconditional mono-variate analysis

  1. Plot their (unconditional) histogram,
  2. Calculate their (unconditional) mean $(\mu_d)$, standard deviation $(\sigma_d)$ and skewness $(\gamma_d)$ of each of the D predictors (means, standard deviations and skewness)

2.1

2.2

3. Class-conditional mono-variate analysis of each of the predictors

  1. Plot their (class-conditional) histogram
  2. Calculate their (unconditional) mean $(\mu_{d|l})$, standard deviation $(\sigma_{d|l})$ and skewness $(\gamma_{d|l})$ of each of the D predictors (means, standard deviations and skewness)

3.1

Data Pre-Processing: Z-Score Normalization

3.2

4. Bi-variate analysis of the predictors

  1. Plot the scatter plots between all pairs of predictors (Investigate the existence of potential relationships between pairs of predictors)
  2. Investigate the presence of potential outliers (BoxPlot)
  3. Show the correlation matrix as an image

4.1

4.2

4.3

5. Multi-variate analysis of the predictors

  1. Perform a principal components analysis of the predictors, retain only the first two principal components.
  2. Plot the scatter plot of the projected observations.

5.1

5.2